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Abstract 

From a high volume stream of weighted items, we want to maintain a generic sample of a certain 
limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the clas- 
sic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present a 
reservoir sampling scheme providing variance optimal estimation of subset sums. More precisely, if we 
have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes 
the average variance over all subsets of size m. In fact, the optimality is against any off-line sampling 
scheme tailored for the concrete set of items seen: no off-line scheme based on k samples can perform 
better than our on-line scheme when it comes to average variance over any subset size. 

Our scheme has no positive covariances between any pair of item estimates. Also, our scheme can 
handle each new item of the stream in 0(log k) time, which is optimal even on the word RAM. 

Categories and Subject Descriptors C.2.3 [Computer-Communications Networks]: Network 
Operations-Afefwofi; monitoring E.l [Data]: Data Structures F.2 [Theory of Computation]: Analysis of 
Algorithms and Problem Complexity G.3 [Mathematics of Computing]: Probability and Statistics H.3 
[Information Systems] : Information Storage and Retrieval 

General Terms Measurement, Theory 

Key words Subset sum estimation, weighted sampling, sampling without replacement, reservoir sampling. 

1 Introduction 

In this paper we focus on sampling from a high volume stream of weighted items. The items arrive faster 
and in larger quantities than can be saved, so only a sample can be stored efficiently. We want to maintain a 
generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. 
In ifTTTl this is the basic function used in a data base system for streams. Applied to Internet traffic analysis, 
the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow 
records from different time intervals of a worm attack whose signature is later determined. The samples 
taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the 
time of sampling. 
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1.1 Reservoir sampling with unbiased estimation 

The problem we consider is classically known as reservoir sampling Ifl2l pp. 138-140]. In reservoir sam- 
pling, we process a stream of (weighted) items. The items arrive one at the time, and a reservoir maintains 
a sample S of the items seen thus far. When a new item arrives, it may be included in the sample S and old 
items may be dropped from S. Old items outside S are never reconsidered. 

In this paper, we think of estimation as an integral part of sampling. Ultimately, we want to use a sample 
to estimate the total weight of any subset of the items seen so far. Fixing notation, we are dealing with a 
stream of items where item % has a positive weight For some integer capacity k > 1, we maintain a 
reservoir S with capacity for at most k samples from the items seen thus far. Let [n] = {1, . . . , n} be the set 
of items seen. Each item i € [n] has a weight estimate wi which is if i S. We require these estimators 
to be unbiased in the sense that E[wi] = Wi. For any subset / C [n], we let wi and wj denote Yliti w i an( ^ 
Y^iei respectively. By linearity of expectation E[wi] = wj. Since all unsampled items have estimates, 
we get wms = wj. Thus wms is an unbiased estimator of wj and it only uses items from the sample. 
Reservoir sampling thus addresses two issues: 

• The streaming issue [13 ] where with limited memory we want to compute a sample from a huge 
stream that passes by only once. 

• The incremental data structure issue of maintaining a sample as new weighted items are inserted. In 
our case, we use the sample to provide quick estimates of sums over arbitrary subsets of the items 
seen thus far. 

1.2 Off-line sampling 

We will compare the above on-line reservoir sampling scheme with an arbitrary off-line sampling scheme 
which gets the n weighted items up front, and can tailor the sampling and estimation freely to this concrete 
set, not having to worry about efficiency or the arrival of more items. The only restriction being the bound 
k on the number of samples. More abstractly, the off-line sampling scheme is an arbitrary probability 
distribution Q, over functions w : [n] — ► R from items i to weight estimates Wi which is unbiased in the 
sense that Ew^n[w~i] = Wi, and which has at most k non-zeros. 

1.3 Variance optimality 

When n items have arrived, for each subset size m < n, we consider the average variance for subsets of size 

m < n: 



We present a reservoir sampling scheme which is variance optimal in the following strong sense. For each 
reservoir size k, stream prefix of n weighted items, and subset size m, there is no off-line sampling scheme 
with k samples getting a smaller average variance Vm than our generic reservoir sampling scheme. 
The average variance measure V m was introduced in ll20l where it was proved that 





(1) 



2 



for any sampling and estimation scheme. Here T,V is the sum of individual variances while VT, is the 
variance of the estimate of the total, that is, 



It follows that we minimize V m for all m if and only if we simultaneously minimize EV and VE, which is 
what our scheme does. The optimal value for VE is 0, meaning that the estimate of the total is exact. 

Let W p denote the expected variance of a random subset including each item i independently with some 
probability p. It is also shown in [20] that W p = p ((1 — p)T>V + pVTi). So if we simultaneously minimize 
EV and VE, we also minimize W p . 

Note that with our optimal VE = 0, by (Q]), we have V m = — ^V- F° r contrast, if some scheme has 
zero covariances, it has VE = EV, and then V m = ~ EV. Hence, with EV fixed, the difference between 
these two cases is a factor for subsets of size m. Similarly, for Wi the variance when VE = is 
smaller by a factor of 2 than the variance when VE = EV. 

With no information given about which kind of subsets are to be estimated, it makes most sense to 
optimize average variance measures like those above giving each item equal opportunity to be included in 
the estimated subset. The scheme that we propose is the first reservoir sampling scheme that minimizes the 
sum of individual variances EV and in fact, it minimizes the average variance V m for all subset sizes m. 

1.4 Negative covariances wherever possible 

Our scheme has no positive covariance between any pair of item estimates. In fact our scheme has negative 
covariance between the estimates of any pair of items each picked with probability below 1. The importance 
of this feature was advocated in [3]. It complements the average-over-subsets variance optimality measures 
with good guarantees for particular subsets: With non-positive covariances, the variance of a subset / C [n] 
is bounded by Var({Dj) = Var(^ ieJ wi) < X^e/ ^ar^). When (some of) the covariances between items 
in I are even negative then the variance of the sum is strictly smaller than the sum of the variances of the 
individual items. So the quality of the estimate improves as the covariances get smaller. Since we have 
VE = it follows that if we examine all pairwise covariances over [n] then our negative covariances cancel 
out the positive variances of the individual items. 

1.5 Efficient for each item 

Also, our sampling and estimation scheme can handle each new item of the stream in 0(log k) worst-case 
time. In a realistic implementation with floating point numbers, we have some precision p and accept 
an error of 2~ p . We have a matching lower bound of f2(log k) on the word RAM for any floating point 
implementation of a reservoir sampling scheme with capacity for k samples which minimizes EV. This 
lower bound is proved in Appendix |B] which would fit in a camera ready version. 

1.6 Known sampling schemes 

We will now discuss known sampling schemes in relation to the qualities of our new proposed scheme: 

• Average variance optimality for any subset size. 

• No positive covariances between items. 




and 
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• Efficient reservoir sampling implementation with capacity for at most k samples. 

While many prior sampling schemes share some of these qualities, they all perform significantly worse on 
others. For example, many sampling schemes from statistics, such as Sunter's method lfT8H . are not suitable 
in our reservoir sampling context because they need to sort the items before deciding which ones to sample. 
We note that the statistics literature is filled with sampling schemes with many different features, and here 
we can only discuss a select few schemes that we consider most relavant in our context. 

In our discussion, we are particularly interested in heavy-tailed distributions (such as power-low distri- 
butions), that frequently occur in practice. In these distributions, a small fraction of dominant items accounts 
for a large fraction of the total weight lfT][T4l. 

Most of the sampling schemes we consider use the standard Horvitz-Thompson estimator iflOll where if 
the probability that item i is sampled is pi, then the weight estimate if sampled is Wi/pi. This estimator is 
unbiased and gives the smallest possible variance for item i among unbiased estimators with this sampling 
probability. 

Uniform sampling without replacement In uniform sampling without replacement, we pick a sample 
of k items uniformly at random. If item i is sampled it gets the Horvitz-Thompson weight estimate Wi = 
win /k. Uniform sampling has obvious variance problems with heavy-tailed distribution because it is likely 
to miss the dominant items. Uniform reservoir sampling has been known since 1962 (§]. Item % > k is 
included with probability k/i, and if so, we drop a random one of the previously sampled items. Thus each 
item is processed in constant time. A subconstant solution by Vitter ll22l does not consider all items, but 
generates directly the random number of items to be skipped before reaching an item to be included in the 
sample. 

Probability proportional to size sampling with replacement (ppswr) In probability proportional to 
size sampling (pps) with replacement (wr), each sample Sj € [n], j € [k], is independent, and equal to i 
with probability wi/wy n y Then i is sampled if i = Sj for some j € [k]. This happens with probability 
Pi = 1 — (1 — Wi/w[ n ]) k , and if i is sampled, it gets the Horvitz-Thompson estimator Wi = Wi/pi. 

ppswr does not work well with heavy-tailed distributions: if few dominant items contain most of the total 
weight, then most samples will be copies of these dominant items. As a result, we are left with comparatively 
few samples of the remaining items. 

We note that there are many estimators for ppswr. A counting based estimator for item i is 
k 3 — —w\ n y This counting based estimator has larger variance for the individual items than the above 
Horvitz-Thompson estimator. However, the counting based estimator does get the total exact, so by £T|), it 
does better on the average for sufficiently large subsets. 

In the case of integer weights, another variant is to divide them into unit weights and then use uniform 
sampling without replacement on these units. The estimate of an item is then the sum of the estimates of its 
units. However, when the total weight is large compared with the number of samples, then this alternative 
is very similar to the previous one. None of these alternatives alleviate the basic problem that most samples 
are likely devoted to a few dominant items, when such items exists. 

A reservoir version of ppswr is suggested in [2j. It uses 0(k) time per item. If a new item dominating 
the preceding items arrives, it is likely going to be used in many of the k samples, so O(k) time is best 
possible for any explicit maintenance of the sample. 
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Probability proportional to size sampling without replacement (ppswor) Another classic scheme, and 
an obvious improvement to ppswr is to do it without replacement (ppswor). Here each new item is chosen 
with probability proportional to size among the items not yet in the sample. 

We can implement ppswor by assigning to each item an independent rank value that is exponentially 
distributed with parameter that is equal to the weight of the item Equivalently, generate an independent 
uniformly random ai G (0, 1), and a rank value qi = — In cn/wi. Assuming that all rank values are distinct, 
the ppswor sample of size k consists of the k items of smallest rank values. 

Ppswor implemented this way is a special case of bottom-k sketches introduced in (3] 01. These sketches 
are defined by assigning independent rank values to the items, where the rank of item i is drawn from 
a distribution that depends on wi. The bottom- k sketch then consists of the k items of smallest ranks. 
We easily implement bottom- A; sampling with a reservoir. The reservoir is maintained in a priority queue 
containing the k items with smallest rank values. When a item i > k arrives, we add it to the priority queue 
and remove the item with highest rank value. Using floating point numbers on the word RAM, we can 
implement a priority queue in O (log log k) time per update |2TI (Interestingly, that is exponentially better 
than the Q, (log k) lower bound we present for any scheme minimizing Y,V.) 

With ppswor, unlike ppswr, the probability that an item is included in the sample is a complicated 
function of all the item weights, and therefore the Horvitz-Thompson estimator is not directly applicable. 
Unbiased estimators for ppswor were recently suggested in (3JIH. The subset conditioning ppswor estimator 
has negative covariances (whenever possible) and has VS = 0. The caveat, however, is that Ty is not 
minimized. In Appendix |A] which would fit in a camera ready version, we prove a negative result for any 
ppswor estimator: We construct an instance for any sample size k and number of items n such that any 
estimation based on up to k + (In k)/2 ppswor samples will perform a factor f2(log k) worse on the average 
variances than our optimal scheme with only k samples. This is the first such negative result for ppswor 
besides the fact that it is not strictly optimal. 

Threshold sampling and inclussion probability proportional to size (ipps) The threshold sampling 
from [6] is a kind of Poisson sampling. In Poisson sampling, each item i is picked independently for S 
with some probability pi. We use the Horvitz-Thompson estimator Wi = u>i/pi if the item is picked. The 
expected number of samples is E[|5|] = YliPi- Since each item is picked individually, all covariances are 
zero so VY, = T.V > 0. 

In threshold sampling we pick a fixed threshold t. We include in the sample S every item whose weight 
is larger than r. An item i whose weight W{ is smaller than r is sampled with probability pi = W{/t. 

We guarantee a sample of expected size k by using the unique threshold such that Yj%Pi = 
min{l, tOj/rfe} is equal to k. A sampled item i G S has estimate Wi = Wi/pi = to*/ min{l, Wi/r^} = 
m&x{wi,T k }. 

In the special case where all weights are smaller than wt n i/k we get = w< n -\/k and pi = Wi/r^ for 
all i. Such sampling probabilities are often refered to as inclussion probability proportional to size (ipps) 
and it is standard that ipps minimizes the total variance among all Poisson sampling schemes with the same 
expected number of samples ifTTl p. 86]. However, in this paper large weights are considered important and 
not a special case. 

In it is argued that threshold sampling with any threshold is at least as good for T,V as any off-line 
scheme with the same expected number of samples. 

Lemma 1 Among off-line sampling schemes using an expected number of at most k < n samples, we 
minimize T,V if and only if each item follows the same marginal distribution as threshold sampling, that is, 
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item i is sampled with probability pi = min{l, Wi/rk] where Tk is chosen so that ^pi = k. Large items 
with Wi > Tk get estimate W{ = W{, and those with smaller weight get estimate Tk if picked. 

In our streaming context, we have the immediate problem that we do not know in advance which threshold 
to use so that the expected number of samples is k. Hence we do not know the sampling probability when the 
item arrives. In Q, it is suggested to use a dynamic threshold r as follows. For each item i = 0, .., n — 1, we 
generate an independent uniformly distributed random G (0, 1). Let the priority of i be qi = Wi/ai. Item 
i is included if qi > r which holds if and only if oti < Wi/r. This happens with probability min{l, w/t} 
exactly as above. In Q it then shown that r can be increased on-line as items arrive, dropping items as the 
threshold passes their priority, so as to maintain a reservoir with an expected number of exactly k samples. 
However, in this paper, we assume that we have allocated resources only for a fixed number of samples, and 
we would have to use significantly smaller expected number of samples to be sure that we stay within the 
bound. 

Priority sampling Priority sampling was introduced in [7] as a threshold style scheme which is tailored 
for reservoir sampling with k as a hard capacity constraint. For each item i = 0, .., n — 1, we again generate 
an independent uniformly random a>i G (0,1), and a priority qi = Wi/cti. Assuming that all priorities 
are distinct, the priority sample S of size k < n consists of the k items of highest priority. An associated 
threshold r is the (k + 1) largest priority. Then i G S 1 <^=^ qi> r. Each sampled item i G S gets a weight 
estimate wi = max{u;j, r}. If i £ S, Wi = 0. Note that all the estimates interact via the random variable r 
which depends on all the random priorities. It is proved in [7] that the priority estimator is unbiased. It is 
also proved that there is zero covariance between estimators, hence VT, = T,V as for threshold sampling. 
Moreover it is proved in lfl9l that priority sampling with k + 1 samples has smaller T,V than threshold 
sampling with an expected number of k samples. Allowing for one extra priority sample as opposed to an 
expected number of k threshold samples is much better when we have a hard capacity constraint on the 
number of samples. A priority sample is a bottom- A; sketch using the ranks ai/wi so it can be efficiently 
implemented using a priority queue, spending 0(log log k) time per item. 

Systematic threshold sampling We now turn to an off-line sampling scheme that minimizes the average 
variance V m for all subset sizes m, but which is hopeless for reservoir sampling. We consider the general 
version of systematic sampling where each item i has an individual sampling probability pi, and if picked, 
a weight estimate of Wi/pi. Contrasting Poisson sampling, the sampling decisions are not independent. 
Instead we pick a single uniformly random number r G (0, 1), and include i in S if and only if for some 
integer j, we have 

^2 Pi <j+r< J2 Pi ' 

h<i h<i 

It is not hard to see that Pv[i G S] = pi. Let k = J2ie[n] Pi ^ e me expected number of samples. Then the 
actual number of samples is either [k\ or \k~\ . In particular, this number is fixed if k is an integer. Below 
we assume that k is integer. 

In systematic threshold sampling, we will assume exactly the same sampling probabilities and estimates 
as in threshold sampling with an expected number of k samples. Therefore we have the same optimal sum 
of individual variances T,V. In fact, we also get VT, = 0. Although not realized in EDI , the proof that 
VT, = with systematic threshold sampling is really a proof of the following general lemma: 

Lemma 2 Among off-line sampling schemes using an expected number of at most k samples, we minimize 
the average variance V m for all subset sizes m if and only if the following conditions are satisfied: 
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• we use the marginal item distributions of threshold sampling, as described in Lemma\T\ and 

• we always perform exactly k samples. 

Thus we have two main goals coinciding: getting a version of threshold sampling that always produces 
exactly k samples, and minimizing the average variance for any subset size. Systematic threshold sampling 
does exactly this. There are, however, some very serious problems with systematic threshold sampling, 
particularly in relation to reservoir sampling. One is that when new items arrive, we have to increase 
the threshold so as to keep the number of samples fixed. However, with systematic threshold sampling, 
a small change in threshold may completely change the set of items sampled so as to including many 
previously discarded items. One possibility is to always change the threshold by a factor 2, essentially 
dropping every other item, but that is not an efficient usage of sampling resources, and we would loose 
optimality. Systematic threshold sampling is therefore pretty useless in a reservoir sampling context. 

Another serious objection to systematic threshold sampling in a streaming context is that we may have a 
very strong positive correlations between items in a subset depending on how they are placed in the stream. 
Hence some subsets may be subject to a huge variance. As an extreme case, if we have n unit items where 
n is even, and we want to sample n/2 items, then systematic sampling will either sample all the even, or all 
the odd items. This means that we have the maximal possible covariance between items of the same parity. 
Normally, it is therefore recommended that the items are shuffled before sampling ifPTl p. 92], but that is not 
possible with reservoir sampling. Even shuffling, however, is not enough to prevent positive covariances for 
certain combinations of weights. 

Our new scheme Like systematic threshold sampling, our new scheme provides optimal average variance 
for any subset size m, but it fixes the above mentioned problems in connection with a stream. Our new 
scheme works for reservoir sampling, maintaining a reservoir of size k that at any time is variance optimal 
with respect to the stream seen so far. Moreover, we have no positive covariances. Finally our scheme is 
efficient in the sense that it only spends 0(log k) time on each new item. 

1.7 A very lucky culmination 

Our new scheme can be seen as a very lucky culmination of previous works. The work on threshold @ and 
priority sampling [7] was focused on the sum of individual variances EV. Threshold sampling minimized 
EV but provided only an expected bound k on the size of the reservoir. Priority sampling got a hard capacity 
bound of k, and as proved in |fT9l it nearly minimized EV in the sense that it was optimal modulo one extra 
sample. We were thus essentially done with EV or equivalently, the average variance V\ of singleton sets. 

However, the importance of negative co- variances for larger subsets was promoted in (3l |20l both as a 
theoretical objective, and as something proving significant in experiments. For average performance, the 
best we can hope for is to get VE = 0. We tried but failed to find estimators for priority sampling giving 
VE = 0. 

However, it follows conveniently from Lemma [2] that if we respect a hard capacity bound of k and 
simultaneously minimize EV exactly, then we automatically get VE = 0, and then by (Q]), we minimize the 
average variance V m for any subset size m. This nice relation only holds with the hard capacity bound k 
and with exact minimization of VE. For contrast, when we minimize EV with threshold sampling using an 
expected number of samples or with priority sampling using one extra sample, we have EV = VE. This 
costs a factor 2 in expected variance for uniformly random subsets. 

Now optimizing EV exactly with at most k samples could be done with systematic threshold sampling, 
but this had two caveats. One was that it could not be implemented in a reservoir setting. The other was 
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that we could have huge positive co-variances. It turns out in our solution that if we want to minimize Y,V 
exactly in a reservoir sampling scheme, then there is essentially only one unique solution, and that solution 
cannot have any positive covariances. 

Thus, as our understanding of the natural goals of reservoir sampling from a stream of weighted items 
developed, whenever we tried to achieve one goal, we ended up getting one more for free, ending with a 
scheme with stronger properties than we would have expected to achieve by any single scheme. 

2 A variance optimal reservoir sampling scheme 

We now describe our new variance optimal sampling scheme, though leaving a more efficient implementa- 
tion till later. We maintain a reservoir R of at most k adjusted weights wi that are unbiased estimators of the 
original weights. Moreover, the total is exact, that is wr = toyi when j items have arrived. The first k items 
are accumulated unchanged in the reservoir with wi = W{. 

When item j > k arrives, we first add it unchanged to the reservoir, setting R pm = R U {j} with 
Wj = Wj. Trivially this preserves unbiased estimation and an exact total, but now we have a reservoir R pre 
with k + 1 items. 

Using a sampling and estimation scheme satisfying the conditions of Lemma [2l we create a sample S 
with k out of the k + 1 items in the i? pre . We then set R = S, and use the estimates from the sample as 
new adjusted weights. We refer to this as the subsampling step. To distinguish the state of the reservoir in 
different rounds we denote by Rj the reservoir after round j, and by i?J re = Rj-i U {j} the reservoir before 
round j. 

We know that we could implement the subsampling using systematic threshold sampling, but this 
reduction-by-one has a particularly simple form. With each of the k + 1 items i in R pre , Lemma [TJ as- 
sociates a sampling probability pi = mxa{l,Wi/rk} such that J2ieRp™Pi = ^- These are the sampling 
probabilities we would also have used for the systematic threshold sample. 

Note that the sum of the probabilities of not being in the sample is J2ieRp rc ^ ~ Pi) = \R pm \ — k = 1. 
We identify these unsample probabilities with disjoint segments of the unit interval, pick a random point in 
the unit interval, and drop the item of the segment containing that point. Each item survives with probability 
Pi, and we have exactly k survivors for the subsample S. The surviving items get the estimates max{u;i, r^} 
of LemmaQ] As described above, we make S the new reservoir R and use the estimates as our new adjusted 
weights. That is Wi = maxjuij, r^} if i € S = R. 

Since an unbiased estimator of an unbiased estimator is unbiased, it is clear that the adjusted weights 
remain unbiased estimators of all items processed. Also, by Lemma[2l the subsampling does not change the 
total, so the total of the adjusted weights remains the total of the weights seen so far. 

It remains to show that whenever we have reduced the reservoir to size k, the items in the reservoir 
follow the optimal marginal distributions of Lemma [TJ We know that we did so locally within the reservoir 
when reducing it from k + 1 to k items, and we need to show this local optimality implies global optimality. 

3 Analysis 

We now analyze the above scheme. We refer to the processing of item j as round j. Let t^j denote the 
threshold used in round j > k. We define = for i < k signifying that in the first k rounds, all items 
are included with probability 1 and their adjusted weights are equal to their weights. Let i?J re and Rj be the 
reservoir before and after the subsampling of round j. 
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Lemma 3 The thresholds T^k, T^k+i, T k,k+2, ■■■ T k,n are strictly increasing. 

Proof For j > k, we know that each of the k adjusted weights in Rj is of the form maxjwj, Tkj} > Tkj. 
Now the new weight Wj+\ = Wj+\ > enters, and we subsample with the new threshold Tkj+i- Suppose 
in contradiction to the lemma that t^j+i < Tkj. Then each of the k old adjusted weights would be as large 
as the new threshold, so they are included with probability 1. In addition, the new weight is included with 
positive probability equal to min{l, Wi/Tkj+i}, so we end up with an expected number of samples strictly 
bigger than k, contradiction that the expected number is exactly k. m 



Lemma 4 If i survives in the sample Rj, j > i, then it is with adjusted weight max{iOj, Tkj}. 

Proof By induction on the rounds. Let i < j — 1 be an item that exists in the reservoir after round j. 
By the induction hypothesis at the end of round j — 1, Wi = max{wi, Tkj-i}. By the definition of the 
subsampling step after round j, wi = max{u5j, Tf-j} = max-fiOj, 7>. Tf. j} which equals max{wi, T^j} 
by Lemma[3] By the definition of the subsampling step the statement also holds for item j if it survives. ■ 



Lemma 5 The multiset of values of adjusted weights after every subsampling step j is fixed regardless of 
the random choices, and so are the thresholds Tkj- 

Proof By induction on the rounds. The statement clearly holds for j < k. Consider round j > k. By 
induction the multiset of the adjusted weights of the items in R? m is fixed regardless of the random choices. 
Since t^j is determined by the multiset of the adjusted weights in i?J rc ! , it also does not depend on the 
random choices. Now, let L be the set of large items i £ i?? re with Wi > t&j. By Lemma|4]each item i G L 
survives in the new Rj with unchanged adjusted weight wi = W{. Also by Lemma [4] the remaining k — \L\ 
items i surviving in Rj have Wi < Tkj, so their adjusted weights are rjtj. It follows that the adjusted 
weights of items in Rj are fixed independent of the random choices. ■ 



Lemma 6 The overall probability that i survives in Rj, j > i, is min{l, Wi/r^j}. 

Proof If item i is in Rj then by Lemma |4] its adjusted weight is wi = max{iyj, tw}. Let pi be 
the probability that item i is indeed in Rj. Since wi is an unbiased estimator of Wi we must have 
Pi max{wj, Tkj} = Wi which holds iff pi = min{l, Wi/r^j}. ■ 

Lemma 7 The marginal item distributions when n items have arrived are those defined by Lemma\l\with 

1~k,n Tfe- 
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Proof With pi the probability that item i survives in R n , we know that J^Pi = k since \R n \ = k. Hence 
by Lemma [6j we have ^ mm{l, Wi/Tk,n} = k. However, we defined as the unique value such that 
Y^i min{l, Wi/Tk : n} = k, so we conclude that Tfc jn = Tfe. ■ 

Theorem 8 When n items have arrived, our reservoir R n of size k provides estimates minimizing the av- 
erage variance V m for subsets of size m compared with any off-line scheme allowed to sample an expected 
number of k samples. 

Proof Our procedure samples exactly k items, so the result follows from Lemma [2] together with Lemma 
We now turn to our other important property of non-positive covariances. 

Theorem 9 When n items have arrived, the estimates of each pair of items have non-positive covariances. 
That is 

E(wiWj) < E(wi)E(wj) . 
Furthermore if both items are included in the sample with probability < 1 then the covariance is negative. 

Proof Since each item always gets the same estimate when sampled, it suffices to prove that pij < PiPj, 
where p^j is the probability that both item i and item j are included. 

If pi = 1 then p^ = pj and therefore p^j = PiPj. Similarly, if pj = 1 then p^ = pi and therefore 
Pij = PiPj- So we assume for the rest of the proof that pi < 1 and pj < 1 and show that under these 
conditions p^ < PiPj. 

Let i < j. We will condition on item i ending in R n , and argue that this decreases the probability of 
item j ending in R n . More precisely, suppose item j has not been dropped before round h > j. We will 
prove that the probability that item j is dropped can only increase if we condition on item i surviving to R n . 
Moreover, we will argue that the increase is strict in the last round, hence that we have a strict increase in 
the overall probability that item j is dropped. 

From lemma [5] we know that all the thresholds are fixed. This implies that if we don't condition on 
item i surviving, then the probability that item £ € R^ c survives the subsample to R^ is the fixed value 
pe : h = min{l, W£ : h/Tk,h}- We now condition on item i surviving. This means that i is present and that we 
rule out that it is dropped which should normally happen with probability 1 — pi^. Since the dropping of 
different items from R^ c constitute disjoint events, we conclude that the probability of dropping any other 
present item is increased by a factor 1/p^h- I n particular, this means that if item j is present in R^ re , the 
probability of dropping j is now (1 — Pj t h) /Pi,h- This is a strict increase if Pi,h,Pj,h < L and that is always 
the case when h = n since we assumed that pi = pi >n < 1 and pj = pj >n < 1. ■ 



4 An efficient implementation 

We first show how to implement our sampling algorithm such that processing each item takes 0(log k) time. 
Later we show how to process each item in O(l) expected amortized time if the input stream is randomly 
permuted. 
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Consider round j > k. Our first goal is to identify the new threshold r = Tkj > Tfcj-i- Then we 
subsample k out of the k + 1 items in i?J rc = Rj-i U {j}. Let itf(i), ...jW^+l) t> e me adjusted weights of 
the items in R? re in sorted order, breaking ties arbitrarily. We first identify the largest number t such that 
wm < t. Here 

w {t) < t <^ k + 1 - t + (^2 )/*"(*) ^ k (E Mix))/™® >t-l. (3) 

x<t x<t 

After finding t we find r as the solution to 

(E *w)/ r = * - 1 ^ r = (E - x ) • < 4) 

To find the item to leave out, we pick a uniformly random number r € (0, 1), and find the smallest d < t 
such that 

E^ 1 ~ *w/ r ) - r dr - E - rT ■ ^ 

x<d x<d 

Then the dth smallest item in -Rj re , is the one we drop to create the sample S = Rj. 

The equations above suggests that we find t, r, and d by a binary search. When we consider an item 
during this search we need to know the number of items of smaller adjusted weight, and their total adjusted 
weight. 

To perform this binary search we represent Rj-i divided into two sets. The set L of large items with 
Wi > an d Wi = Wi, and the set T = Rj-i \ L of small items whose adjusted weight is equal to the 

threshold t^j-i- We represent L in sorted order by a balanced binary search tree. Each node in this tree 
stores the number of items in its subtree and their total weight. We represent T in sorted order (here in fact 
the order could be arbitrary) by a balanced binary search tree, where each node in this tree stores the number 
of items in its subtree. If we multiply the number of items in a subtree of T by Tkj-i we get their total 
adjusted weight. 

The height of each of these two trees is 0(log k) so we can insert or delete an element, or concatenate or 
split a list in 0(log k) time [5 ]. Furthermore, if we follow a path down from the root of one of these trees to 
a node v, then by accumulating counters from roots of subtrees hanging to the left of the path, and smaller 
nodes on the path, we can maintain the number of items in the tree smaller than the one at v, and the total 
adjusted weight of these items. 

We process item j as follows. If item j is large, that is Wj > T^j-i, we insert it into the tree representing 
L. Then we find t by searching the tree over L as follows. While at a node v we compute the total number 
of items smaller than the one at v by adding to the number of such items in L, \T\ or \T\ + 1 depending upon 
whether Wj < t^j-i or not. Similarly, we compute the total adjusted weight of items smaller than the one 
at v by adding \T\rf. j_x to the total weight of such items L, and Wj if Wj < T),j-\. Then we use Equation 
© to decide if t is the index of the item at v, or we should proceed to the left or to the right child of v. After 
computing t we compute r by Equation ©. Next we identify d by first considering item j if Wj < Tfc 
and then searching either the tree over T or the tree over L in a way similar to the search for computing t but 
using Equation ((51). Once finding d our subsample becomes Rj = S = i?J ro \ {d}. All this takes 0(log k). 

Last we update our representation of the reservoir, so that it corresponds to Rj and ,-. We insert Wj into 
T if Wj < Tfcj-i (otherwise it had already been inserted into L). We also delete d from the list containing 
it. If W( t j was a large weight we split L at wu) and concatenate the prefix of L to T. Our balanced trees 
support concatenation and split in 0(log k) time, so this does not affect our overall time bounds. Thus we 
have proved the following theorem. 
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Theorem 10 With the above implementation, our reservoir sampling algorithm processes each new item in 
0(log k) time. 

In the above implementation we have assumed constant time access to real numbers including the random 
r G (0, 1). Real computers do not support real reals, so in practice we would suggest using floating point 
numbers with precision p S> logra, accepting a fractional error of order 1/2 P . 

4.1 Typical simple cases 

We call the processing of a new item simple if it is not selected for the reservoir and if the threshold does 
not increase above any of the previous large weights. We will argue that the simple case is dominating if 
n^> k, and in section l4~2l we get a substantial speed-up by reducing the processing time of the simple case 
to a constant. 

Lemma [6] implies that our reservoir sampling scheme satisfies the condition of the following simple 
lemma: 

Lemma 11 Consider a reservoir sampling scheme with capacity k such that when any stream prefix I has 
passed by, the probability that i £ / is in the current reservoir is independent of the order of I. If a stream 
of n items is randomly permuted, then the expected number of times that the newest item is included in the 
reservoir is bounded by k(\n(n/k) + 0(1)). 

Proof Consider any prefix I of the stream. The average probability that an item i € / is in the reservoir 
i? is < k/\I\. If I is randomly permuted, then this is the expected probability that the last item of I 

is in R. By linearity of expectation, we get that the expected number of times the newest item is included in 
R is bounded by k + £™ =fc+1 k/j = fc(l + H n - H k+1 ) = k{\n{n/k) + O(l)). ■ 

As an easy consequence, we get 

Lemma 12 When we apply our reservoir sampling algorithm to a randomly permuted stream, the expected 
number of times that the threshold passes a weight in the reservoir is bounded by k(\n(n/k) + 0(1))- 

Proof Since the threshold is increasing, a weight in the reservoir can only be passed once, and we 
know from Lemma \TT\ that the expected number of weights ever entering the reservoir is bounded by 
kQn(n/k)+0(l)). m 



4.2 Constant time processing of simple cases 

We now show how to perform a simple case in constant time. To do so, we maintain the smallest of the large 
weights in the reservoir in a variable wg. 

We now start the processing of item j, hoping for it to be a simple case. We assume we know the 
cardinality of the set T of items in Rj-% with weight no higher than Tk,j-i- Tentatively as in (0]) we compute 

r = (wj + |T|r fc j_i)/|r|. 

If Wj > t or t > we, we cannot be in the simple case, so we revert to the original implementation. 
Otherwise, r has its correct value, and we proceed to generate the random number r € (0, 1) from the 
original algorithm. If 

(r — Wj) > rr, 
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we would include the new item, so we revert to the original algorithm using this value of r. Otherwise, 
we skip item j setting t^j = t. No further processing is required, so we are done in constant time. The 
reservoir and its division into large and small items is unchanged. 

Theorem 13 A randomly permuted stream of length n is processed in 0{n + fc(log A;) (log n)) time. 

Proof We spend only constant time in the simple cases. From Lemma ITTI and [T2l we get that the expected 
number of non-simple cases is at most 2k(ln(n/k) + 0(1)) = 0(A;(log(n/A;)), and we spend only 0(log£;) 
time in these cases. ■ 



4.3 Simpler amortized implementation 

We will present a simpler implementation of our scheme based on two standard priority queues. This version 
will also handle the above simple cases in constant time. From a worst-case perspective this will not be as 
good because we may spend 0{k log log k) time on processing a single item, but on the other hand, it is 
guaranteed to process any sequence of k items within this time bound. Thus the amortized cost per item is 
0(log log k), which is exponentially better than the previous 0(log k) worst-case bound. 

The simple idea is to use a priority queue for the set L of large items, that is, items whose weight exceeds 
the current threshold r. The priorities of the large items are just their weight. The priority queue provides us 
the lightest large item £ from L in constant time. Assuming integer or floating point representation, we can 
update L in 0(loglog/c) time [21]. The items in T are maintained in an initial segement of an array with 
capacity for k items. 

We now consider the arrival of a new item j with weight wj, and let Tj-\ denote the current threshold. 
All items in T have adjusted weight Tj_i while all other weight have no adjustements to their weights. 

We are building a set S with items outside T that we know are smaller than the upcoming threshold 
Tj > Tj-i. To start with, if Wj < r 3 -_i, we set S = {j}; otherwise we set S = and add item j to L. 

We are going to move items from L to S until L only contains items bigger than the upcoming threshold 
Tj. For that purpose, we will maintain the sum W of adjusted weights in S U T. The sum over T is known 
as Tj_i|T| to which we add Wj if S = {j}. 

The priority queue over L provides us with the lightest item i in L. From (0) we know that I should be 
moved to S if and only if 

(W + w e )/wi > \S\ + \T\. (6) 

If © is satisfied, we delete I from L and insert it in S while adding we to W. We repeat these moves until 
L is empty or we get a contradiction to ©. 
We can now compute the new theshold Tj as 

Tj = W/(\S\ + \T\). 

Our remaining task is to find an item to be deleted based a uniformly random number r G (0,1). If the total 
weight ws in S is such that \S\ — ws/tj < r, we delete an item from S as follows. With S represented as 
an array. Incrementing i starting from 1, we stop as soon as we get a value such that i — Ws\i..i}/ r i ^ r > an d 
then we delete S[i — 1] from S, replacing it by the last item from S in the array. 

If we do not delete an item from S, just delete a uniformly random item from T. Since T fills an initial 
segment of an array, we just generate a random number i G [\T\], and set T[i] = T[\T\). Now \T\ is one 
smaller. 
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Having discarded an item from S or T, we move all remaining items in S to the array of T, placing 
them behind the current items in T. All members of T have the new implicit adjusted weight tj. We are 
now done processing item j, ready for the next item to arrive. 

Theorem 14 The above implementation processes items in 0(log log k) time amortized time when averaged 
over k items. Simple cases are handled in constant time, and are not part of the above amortization. As a 
result, we process a randomly permuted stream of length n is in 0{n + A; (log log A;) (log n)) time. 

Proof First we argue that over k items, the number of priority queue updates for L is 0{k). Only new 
items are inserted in L and we started with at most k items in L, so the total number of updates is 0{k), 
and each of them take 0(log log k) time. The remaining cost of processing a given item j is a constant plus 
0{\S\) where S may include the new item j and items taken from L. We saw above that we could only take 
0{k) items from L over the processing of k items. 

In the simple cases, the new item j ends up being skipped in constant time without any changes to L, and 
hence they can be ignored from the above amortization. Finally, we derive the result for randomly permuted 
sequences as we derived Theorem [T3l but exploiting the better amortized time bound of O (log log k) for 
the non-simple cases. ■ 
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A Bad case for ppswor 

We will now provide a bad instance for probability proportional to size sampling without replacement 
(ppswor) sampling k out of n items. Even if ppswor is allowed k + (Ink)/ 2 samples, it will perform a 
factor fj(logfc) worse on the average variance for any subset size m than the optimal scheme with k sam- 
ples. Since the optimal scheme has VT, = 0, it suffices to prove the statement concerning EV. The negative 
result is independent of the ppswor estimator as long as unsampled items get estimate 0. The proof of this 
negative result is only sketched below. 

Let i = n — k — 1. Theinstance has k — 1 items of size i and t unit items. The optimal scheme will pick 
all the large items and one random unit item. Hence T.V is £(1 — l/£)£ < £ 2 ■ 

Now, with ppswor, there is some probability that a large item is not picked, and when that happens, it 
contributes I 2 to the variance. We will prove that with the first k ppswor samples, we waste approximately 
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In k samples on unit items, which are hence missing for the large items, and even if we get half that many 
extra samples, the variance contribution from missing large items is going to be £l(£ 2 log k). 

For the analysis, suppose we were going to sample all items with ppswor. Let Uj be the number of unit 
items we sample between the i — 1st and the ith large item. Each sample we get has a probability of almost 
(k — i)/(k — i + 1) of being large. We say almost because there may be less than £ remaining unit items. 
However, we want to show w.h.p. that close to In k unit items are sampled, so for a contradiction, we can 
assume that at least I — In k « £ unit items remain. As a result, the expected number of unit items in the 
interval is (A; — i + l)/(& — i) — 1 = l/(k — i). This means that by the time we get to the k — [mfc]th 
large item, the expected number of unit samples is X^Li* ~~ *) ~ Since we are adding almost 

independent random variables each of which is at most one, we have a sharp concentration, so by the time 
we have gotten to the k — [In k~\ large item, we have approximately In k unit samples with high probability. 

To get a formal proof using Chernoff bounds, for the number of unit items between large item i — 1 and 
i, we can use a pessimistic 0-1 random variable dominated be the above expected number. This variable is 
1 with probability l/(k — % + 1)(1 — lnk/£) which is less than the probability that the next item is small, 
and now we have independent variables for different rounds. 

B An f2(log k) time lower bound for variance optimal schemes 

Our above 0(log k) worst-case time implementation is elementary based on standard balanced search trees, 
but with priority sampling, we have an 0(log log k) RAM implementation based on RAM priority queues 
Ell . This assumes that we use floating point numbers with precision p 3> logn, accepting a fractional 
error of order l/2 p . Here we argue that we cannot use the RAM to get any corresponding improvement if 
we want to optimize SV. We know from Lemma Q] that if we want to minimize Y<V then we will have r 
among estimators, so we need to compute r which is the unique number satisfying 

min{l, Wj/r} = k <J=^ 'y^^{wj\i g R, Wj < r} = r(k — \{wj\i € R, Wj > t}|). 

Using 0(log nj log log n) dynamic rank based on atomic heaps by it is not hard to prove that if we 
could identify r in sublogarithmic time, then we would be able to solve prefix sum in sublogarithmic time, 
contradicting a RAM lower bound for the incremental case of O(logn) from lfT5l [I6l . This lower bound 
is for integer weights in words, but that can be simulated with floating point numbers with a corresponding 
precision. 
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